ggplot2 is a popular package for visualizing data in R. It’s been around for years and has pretty good documentation and tons of example code around the web. This lesson will introduce you to the basic components of working with ggplot2.
R provides many ways to get your data into a plot. Three common ones are,
plot, hist, etc`)All of them work. I use base graphics for simple, quick and dirty plots. I use ggplot2 for most everything else.
ggplot2 excels at making complicated plots easy and easy plots simple enough.
More: http://ggplot2.tidyverse.org/reference/index.html#section-layer-geoms
Every graphic you make in ggplot2 will have at least one aesthetic and at least one geom. The aesthetic maps your data to your geometry. Your geometry specifies the type of plot we’re making (point, bar, etc.).
What makes ggplot really powerful is how quickly we can make this plot visualize more of our data.
Coloring each point by class (compact, van, pickup, etc.) is just a quick extra bit of code:
Aside: How did I know to write color = class? aes will pass its arguments on to any geoms you use and we can find out what aesthetic mappings geom_point takes with ?geom_point (see section “Aesthetics”)
What if we just wanted the color of the points to be blue?
Well that’s weird. What happened here? This is the difference between setting and mapping in ggplot. The aes function only takes mappings from our data onto our geom. If we want to make all the points blue, we need to set it inside the geom:
Sizing each point by the number of cylinders is easy:
Making separate plots (small multiples) is also quick:
Let’s make a histogram instead:
You’ll see with a warning (red text) something about stat_bin and a bins argument. ggplot2 can calculate statistics on our data such as frequencies and, in this case, it’s doing that on our hwy column with the stat_bin function. We should change the bins argument:
I’m a big fan of histograms and ggplot2 can plot these too:
Oops, we got an erro:
Warning message: Continuous x aesthetic – did you forget aes(group=…)?
That’s because we need to convert cyl to a factor:
Another type of visualization I use a lot for visualizing distributions is the violin plot:
So far we’ve made really simple plots: One geometry per plot. Let’s layer multiple geometries on top of one another to show the raw points on top of the violins:
ggplot2 also helps us do some quick-and-dirty modeling:
geom_smooth() is pretty configurable. Here we set the method to lm instead of the default loess:
Plot limits can be controlled one of three ways:
limits argument on one or both scalesxlim and ylim arguments in coord_cartesianLet’s show this with an example plot:
Since we’re plotting data where the zero point on the vertical axis means something, maybe we want to start the vertical axis at 0:
Or maybe we want to zoom in on just the 2000’s and beyond:
Note the warning message we received:
Warning message: Removed 390 rows containing missing values (geom_path).
That’s normal when data in your input data.frame are outside the range we’re plotting.
Let’s use coord_cartesian instead to change the x and y limits:
Note the **slight* difference when using coord_cartesian: ggplot didn’t put a buffer around our values. Sometimes we want this and sometimes we don’t and it’s good to know this difference.
We’ll use scales in ggplot2 very often. The usual use case is to do things like changing scale limits or change the way our data are mapped onto our geom.
For example, how do we override the default colors ggplot2 uses here?
Most scales follow the format `scale_{aesthetic}_{method} where aesthetic are our aesthetic mappings such as color, fill, shape and method is how the colors, fill colors, and shapes are chosen.
We can also use scales to rescale our data. Here’s some census data, unscaled:
And scaled (log10):
We can override the labels:
Or change the breaks:
Facets allow us to perform a powerful visualization called a small multiple:
http://www.latimes.com/local/lanow/la-me-g-california-drought-map-htmlstory.html
I use small multiples all the time when I have a variable like a site or year and I want to quickly compare across years.
Let’s compare highway fuel economy versus engine displacement across our two samples:
Or fuel economy versus engine displacement across manufacturer:
ggplot2 offers us a very highly level of customizability in, what I think, is a fairly easy to discover and remember way with the theme function and pre-set themes.
Let’s use another theme than the default:
Let’s change the way the legend displays:
Let’s adjust our axis labels and title:
Challenge: Look at the help for ?theme and try changing something else about the above plot.
more themes at https://github.com/jrnold/ggthemes
Let’s save that great plot we just made. Saving plots in ggplot is done with the ggsave() function:
ggsave automatically chooses the format based on your file extension and guesses a default image size. We can customize the size with the width and height arguments:
## OGR data source with driver: ESRI Shapefile
## Source: "../publishing-maps-to-the-web-in-r/data/cb_2016_us_state_20m/cb_2016_us_state_20m.shp", layer: "cb_2016_us_state_20m"
## with 52 features
## It has 9 fields
Challenge: Reproject the data before plotting